Skip to content

Fix TCP bridge hang when a client stalls or dies silently#49

Open
marklynch wants to merge 3 commits into
mainfrom
claude/system-crash-analysis-gt0bli
Open

Fix TCP bridge hang when a client stalls or dies silently#49
marklynch wants to merge 3 commits into
mainfrom
claude/system-crash-analysis-gt0bli

Conversation

@marklynch

Copy link
Copy Markdown
Owner

The client socket returned by accept() was blocking with no send timeout,
and all bus-forwarding sends used a blocking send(). If a connected client
stopped draining (network drop with no FIN/RST), the kernel send buffer
filled and send() blocked the bridge task forever. That task is also the
only UART reader, so the whole device froze with no panic, no logs and no
recovery — matching the observed weeks-stable-then-dead crash.

Make the client socket non-blocking and give send_to_client a keep/drop
policy: a full send keeps the client, a full send buffer (EAGAIN) drops the
message but keeps the connection, and a partial write or hard error drops
the client. Add TCP keepalive (idle 30s, interval 5s, count 3) so a silently
dead peer is detected and the single client slot freed. Factor the client
teardown into a shared helper.

Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com
Claude-Session: https://claude.ai/code/session_018oMk4krXAUqhX5gWtvc1mf

claude added 3 commits June 25, 2026 09:53
The client socket returned by accept() was blocking with no send timeout,
and all bus-forwarding sends used a blocking send(). If a connected client
stopped draining (network drop with no FIN/RST), the kernel send buffer
filled and send() blocked the bridge task forever. That task is also the
only UART reader, so the whole device froze with no panic, no logs and no
recovery — matching the observed weeks-stable-then-dead crash.

Make the client socket non-blocking and give send_to_client a keep/drop
policy: a full send keeps the client, a full send buffer (EAGAIN) drops the
message but keeps the connection, and a partial write or hard error drops
the client. Add TCP keepalive (idle 30s, interval 5s, count 3) so a silently
dead peer is detected and the single client slot freed. Factor the client
teardown into a shared helper.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_018oMk4krXAUqhX5gWtvc1mf
The /status JSON already exposed free/min-free heap, but a web readout only
helps while the device is still responsive — useless once it has locked up.
Add a low-priority heap_monitor task that logs free heap, the minimum-free
watermark, and the largest free block every 5 minutes, so a slow leak or
growing fragmentation shows up as a trend in the console/TCP log history
before it exhausts memory. Largest-free-block specifically surfaces
fragmentation (e.g. from repeated MQTT discovery republishes), which total
free heap alone can hide.

Also surface the heap figures on the home page: the System table now shows a
Memory row (free / min-free) built from the memory fields already present in
the /status response.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_018oMk4krXAUqhX5gWtvc1mf
esp_mqtt_client_stop() blocks waiting for the MQTT task to acknowledge.
Calling it from WIFI_EVENT_STA_DISCONNECTED runs it on the system event-loop
task, so if the MQTT task was itself stuck on a dead socket during the same
network outage, the stop would wedge the event loop and stall all further
WiFi/IP event processing — a plausible silent-hang path under WiFi flapping.

Follow the documented esp-mqtt pattern instead: start the client once on
first connectivity and let esp-mqtt manage reconnection internally across
WiFi drops and restores. mqtt_client_start() is now idempotent so the
GOT_IP handler can keep calling it on every reconnect as a no-op.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_018oMk4krXAUqhX5gWtvc1mf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants